Sixt Data Science Lab - Test Task for Data Scientist Job Candidates

Introduction

In this test task you will have an opportunity to demonstrate your skills of a Data Scientist from various angles - processing data, analyzing and vizalizing it, finding insights, applying predictive techniques and explaining your reasoning about it.

The task is based around a bike sharing dataset openly available at UCI Machine Learning Repository [1].

Please go through the steps below, build up the necessary code and comment on your choices.

About Dataset:

Abstract: This dataset contains the hourly and daily count of rental bikes between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.

Data Set Information: Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

Attribute Information:

Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv

Part 1 - Data Loading and Environment Preparation

Tasks:

  1. Prepare a Python 3 virtual environment (with virtualenv command). requirements.txt output of pip freeze command should be included as part of your submission.
  2. Load the data from UCI Repository and put it into the same folder with the notebook. The link to it is https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset .
  3. Load the data into Python runtime as Pandas dataframe.
  4. Split the data into two parts. One dataset containing the last 30 days and one dataset with the rest. You will need the dataset with the last 30 days in part 5

PART-1 Solution

TASKS 1.1 , 1.2

Task 1.3 : Loading dataset as pandas df:

TASK 1.4 Data Split

Split the data into two parts. One dataset containing the last 30 days and one dataset with the rest.

Answers / comments / reasoning:

Part 2 - Data Processing and Analysis

Tasks:

  1. Perform all needed steps to load and clean the data. Please comment the major steps of your code.
  2. Visualise rentals of bikes per day.
  3. Assume that each bike has exactly maximum 12 rentals per day.
    • Find the maximum number of bicycles nmax that was needed in any one day.
    • Find the 95%-percentile of bicycles n95 that was needed in any one day.
  4. Vizalize the distribution of the covered days depending on the number of available bicycles (e.g. nmax bicycles would cover 100% of days, n95 covers 95%, etc.)

TASK 2.1: Pre-Processing the Dataset

Task: Perform all needed steps to load and clean the data. Please comment the major steps of your code.

Step -1 : Dealing with Null Data if any:

Step 2: Data Type Conversions of Columns (if required):

Only 1 column : " Date of the day " is of object type, converting it to datetime format. All other columns are already category encoded as mentioned in dataset description too.

Step 3: Dealing With Outliers if any:

Step 4: Dealing with Duplicated Values(if any):

There are no duplicated instances in our data.

Step 5: Checking for correlations:

From correlation matrix, higher correlational variables can be observed:

and others

Step 6: Checking Indepependent Variable Correlation with Target Variable:

Hence, Columns with Lowest correlation with target Variables are:

All other variables carries values greater than 0.2 hence have good impact on values of rental bikes count.

Answers / comments / reasoning:

From correlation matrix, higher correlational variables can be observed:

and others

TASK 2.2 : Visualise rentals of bikes per day:

For columns in the dataset:

Hower on the graph for datewise information, you can use slider to filter month specific graphs information.

TASK 2.3

Assume that each bike has exactly maximum 12 rentals per day.

$$ 12-rentals ---> 1-bike $$

No. of bikes required Per Day,where x is the rides for that day: $$ x-rentals ---> ((1/12) * x)-bikes $$

Hence,

PART 2.4 Covered Days

Task: Vizalize the distribution of the covered days depending on the number of available bicycles (e.g. nmax bicycles would cover 100% of days, n95 covers 95%, etc.)

Hence,

i.e. if bikes_req = 727 -- 100% coverage , 2 bikes are for 0% coverage

Calculating Percentile Rank for all the values in our dataset:

From the graph its clearly visible:

Part 3 - Building prediction models

Tasks:

  1. Define a test metric for predicting the daily demand for bike sharing, which you would like to use to measure the accuracy of the constructed models, and explain your choice.
  2. Build a demand prediction model with Random Forest, preferably making use of following python libraries: scikit-learn.
  3. Report the value of the chosen test metric on the provided data.

Task: 3.1 : Choosing Test metric to Evaluate the accuracy of model

While there are many evaluation metrics to evalaute regression model performance , including:

I will be using RMSE because of the following reasons:

Though I will be most relying on RMSE , I will be using more than 1 evaluation metric to measure my model accuracy more accurately! I will also be testing my model on train data along with test to observe overfitting if any.

Task 3.2 Predicting daily demand of rides using Random Forest

Model 1: Random Forest Prediction Using all Independent Variables:

Model 2: Random Forest Prediction Using only High Correlation Independent Variables:

Task: 3.3 Evaluating the Model Predictions

Evaluating Model 1 Predictions

Evaluating Model 2 Predictions:

Answers / comments / reasoning:

Part 4 - Fine-tuning of one of the models

Tasks:

  1. Take one of the above constructed models and finetune its most important hyperparameters
  2. Explain your choice for the hyperparameters
  3. Report the improvement of your test metric

Task 4.1 : FineTuning Hyperparameters for Model 1:

Model3 : Hypertuned

Observations: Performace decreased, there is still huge difference for test and train dataset.

Model 4: Hypertuned , Increasing no of trees, max_depth and split

The performance improved with 1000 estimators over 500 and more depth and split. It means we need to find a value in-between. However 50 estimators are still the best performers.

Model 5: Hyptertuning : Average estimators and split , no bootsraping

We can see that bootstapped dataset performed exceptionally well , as there is huge decrement in performance with non-bootsrapped dataset

Model 6: Hypertuned with K-Fold Cross Validation

The error with K-Fold cross vaidation case is larger than other case, hence cross validation doesnt perform good for our case.

Task 4.2 : Explaining Choice of Parameters:

Task 4.3: Reporting Improvements in Test Metric:

Without Hypertuning (but removing less correlational columns):

Best performer After Hypyertuning:

Answers / comments / reasoning:

-

Part 5 - Optimise (revenue - cost) by adapting number of bicycles

Tasks:

  1. Assume that the revenue per rental is x (your own assumed number).
  2. Each bicycle has costs of y per day (your own assumed number).
  3. Determine residuals from your test set (after predicting demand of bike sharing). Consider the residuals as random shocks affecting the average values and resulting in real observed values. Assume this random variable is gaussian distributed. Calculate mean and standard deviation and use it as approximation for a gaussian distribution where you can sample from.
  4. Simulate the profit with a fixed number of nmax (from part 2) bicycles for the next 30 days given that the real observed values are expected to be different from average predicted values. Calculate the demand by adding the simulated residuals to calculated expected values from the data you put aside in part 1.
  5. Use grid search along the number of available bikes to find the optimal number of bikes to obtain highest profit (revenue - cost) from simulations.

TASK 5.1 , 5.2:

Task 5.1,5.2 : Assume that the revenue per rental is x (your own assumed number).Each bicycle has costs of y per day (your own assumed number).

Task 5.3: Residuals

Task:

Residual = Observed Value - Predicted Value

Task: 5.3.1 Determing Residuals and Considering It as Random Shocks

$$ Residual = ObservedValue - PredictedValue $$

Task 5.3.2 - Calculating Mean and Standard Deviation for residual distribution(test data) :

Task 5.3.3 Approximation of a new guassian distribution using earlier calculated mean and std deviation

Task 5.4:

Simulate the profit with a fixed number of nmax (from part 2) bicycles for the next 30 days given that the real observed values are expected to be different from average predicted values. Calculate the demand by adding the simulated residuals to calculated expected values from the data you put aside in part 1.

Task 5.4.1 : Calculating demand by adding simulated residuals:

Task 5.4.2: Calculating profit using nmax and demand calculated with simuated residual for 30 days:

$$profit = earnings - expenditure$$$$ expediture-for-1-day = 500(x)*nmax $$$$ earnings-for-1-day = 100(y)*day-demand$$

Use grid search along the number of available bikes to find the optimal number of bikes to obtain highest profit (revenue - cost) from simulations.

Predicting accuracy on grid search model:

Predicting for 30 days to calculate Profit with Optimal Bikes:

Highest Profit from Optimal Bikes :

Answers / comments / reasoning:

hence, our profit improved with using optimal bikes predicted using grid search

$$ProfitImporvemnts = 15-Lakhs(In30Days)$$

Part 6 - Reflection / comments

Tasks: (Optional) Please share with us any free form reflection, comments or feedback you have in the context of this test task.

Taks 6:

Submission

Please submit this notebook with your developments in .ipynb and .html formats as well as your requirements.txt file.

References

[1] Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.